Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

tors. The discriminators try to distinguish the “real” from the “fake,” and the generator

tries to make the discriminators unable to work well. The result is a rectiﬁed process and a

unique architecture with a more precise estimation of the full precision model. Pruning is

also explored to improve the applicability of the 1-bit model in practical applications in the

GAN framework. To accomplish this, we integrate quantization and pruning into a uniﬁed

framework.

3.6.1

Loss Function

The rectiﬁcation process combines full precision kernels and feature maps to rectify the

binarization process. It includes kernel approximation and adversarial learning. This learn-

able kernel approximation leads to a unique architecture with a precise estimation of the

convolutional ﬁlters by minimizing kernel loss. Discriminators D(·) with ﬁlters Y are intro-

duced to distinguish feature maps R of the full precision model from those T of RBCN. The

RBCN generator with ﬁlters W and matrices C is trained with Y using knowledge of the

supervised feature maps R. In summary, W, C and Y are learned by solving the following

optimization problem:

arg min

W, ^ˆ

W ,C

max

L = LAdv(W, ^ˆW, C, Y ) + LS(W, ^ˆW, C) + LKernel(W, ^ˆW, C),

(3.62)

where LAdv(W, ^ˆW, C, Y ) is the adversarial loss as

LAdv(W, ^ˆW, C, Y ) = log(D(R; Y )) + log(1 −D(T; Y )),

(3.63)

where D(·) consists of a series of basic blocks, each containing linear and LeakyRelu layers.

We also have multiple discriminators to rectify the binarization training process.

In addition, LKernel(W, ^ˆW, C) denotes the kernel loss between the learned full precision

ﬁlters W and the binarized ﬁlters ^ˆW and is deﬁned as:

LKernel(W, ^ˆW, C) = λ1/2||W −C ^ˆW||²,

(3.64)

where λ1 is a balance parameter. Finally, LS is a traditional problem-dependent loss, such

as softmax loss. The adversarial, kernel, and softmax loss are regularizations on L .

For simplicity, the update of the discriminators is omitted in the following description

until Algorithm 13. We also have omitted log(·) and rewritten the optimization in Eq. 6.79

as in Eq. 3.65 for simplicity.

min

W, ^ˆ

W ,C

LS(W, ^ˆW, C) + λ1/2

||W ^l

i ⁻^C^l^ˆ^W^l

i ^||²⁺

||1 −D(T ^l

i ^;^Y⁾^||²^.

(3.65)

where i represents the i^thchannel and l represents the l^thlayer. In Eq. 3.65, the objective

is to obtain W, ^ˆW and C with Y ﬁxed, which is why the term D(R; Y ) in Eq. 6.79 can

be ignored. The update process for Y is found in Algorithm 13. The advantage of our

formulation in Eq. 3.65 lies in that the loss function is trainable, which means it can be

easily incorporated into existing learning frameworks.

3.6.2

Learning RBCNs

In RBCNs, convolution is implemented using W ^l, C^land F ^l

in ^{to calculate output feature}

maps F ^l

out ^as

F ^l

out ⁼^RBConv⁽^F^l

in^{; ˆ}^W^l^{, C}^l^{) =}^Conv⁽^F^l

in^,^ˆ^W^l^⊙^C^l⁾^,

(3.66)